Homogeneous bipartition based on multidimensional ranking

نویسنده

  • Michaël Aupetit
چکیده

We present an algorithm which partitions a data set in two parts with equal size and experimentally nearly the same distribution measured through the likelihood of a Parzen kernel density estimator. The generation of the partition takes O( 1 2 N(N − 1)) operations (N number of data) and is 2 orders of magnitude faster than the state of the art. 1 Generating an equi-distributed bipartition 1.1 Problem and applications We consider the problem of generating a homogeneous partition of a data set, that is a partition such that each part has the same distribution as the whole. We focus here on bipartitions, which are partitions containing two parts with equal size (the number N of data is even) (Figure 1a). This problem has been studied under the name ”data squashing” [2] to summarize massive data sets in a way which preserves statistical relationships among variables better than random sampling; in biostatistics to select patients for treatment and control groups [5]; and in Machine Learning as a way to improve the estimation of model complexity [7]. 1.2 Inspiration from two-sample tests To check whether two groups are drawn from the same distribution, one can use a two-sample test. Suppose a blue-red bipartition of the data. Several two-sample tests are based on building a proximity graph of the data and counting the number of mixed edges, i.e. which have one red and one blue vertices. The higher this number, the higher the probability that both red and blue data are drawn from the same distribution. Then the null hypothesis that both parts are homogeneous is rejected if the number of mixed edges is too small. Friedman and Rafsky [3] proposed a test based on the Minimal Spanning tree. Recently, Rosenbaum proposed the ”cross-matching” test [6], which uses a minimal distance non-bipartite matching [4]. The homogeneous bipartitionning problem can bring down to build a bipartition of the data which succeeds in passing a multivariate two-sample test. One way to do this is to generate several random bipartitions and keep the one A proximity graph connects two points if they are close to each other with respect to some measure of closeness. Example of such graphs are the Minimum Spanning tree, the Nearest Neighbor graph or the Delaunay graph.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An extended multidimensional Hardy-Hilbert-type inequality with a general homogeneous kernel

In this paper, by the use of the weight coefficients, the transfer formula and the technique of real analysis, an extended multidimensional Hardy-Hilbert-type inequality with a general homogeneous kernel and a best possible constant factor is given. Moreover, the equivalent forms, the operator expressions and a few examples are considered.

متن کامل

Ranking fMRI time courses by minimum spanning trees: assessing coactivation in fMRI.

In fMRI, time courses with similar temporal "activation" patterns may belong to different brain regions (i.e., these regions are functionally connected, coactivated). A group of time courses (TCs) corresponding to a particular type of temporal activation pattern should be maximally self-consistent (homogeneous). We demonstrate that ordering a group of multidimensional fMRI time courses by a min...

متن کامل

The New Method for Ranking Grouped Credit Customer Based on DEA Method

Data Envelopment Analysis (DEA) is a widely used non-parametric method for ranking by Decision-Making Units (DMU). Despite the fact that DEA method does not require numerous preconditions, the necessity of the DMUs to be homogeneous is one of the most important rules in applying this technique. Moreover, in real world problems, due to the nature of DMUs, the need for ranking the grouped data ha...

متن کامل

Choosing weights for a complete ranking of DMUs in DEA and cross-evaluation

Conventional data envelopment analysis (DEA) assists decision makers in distinguishing between efficient and inefficient decision making units (DMUs) in a homogeneous group. However, DEA does not provide more information about the efficient DMUs. One of the interesting research subjects is to discriminate between efficient DMUs. The aim of this paper is ranking all efficient (extreme and non-ex...

متن کامل

Case-Based Multilabel Ranking

We present a case-based approach to multilabel ranking, a recent extension of the well-known problem of multilabel classification. Roughly speaking, a multilabel ranking refines a multilabel classification in the sense that, while the latter only splits a predefined label set into relevant and irrelevant labels, the former furthermore puts the labels within both parts of this bipartition in a t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008